Validity F.Heidary
The codes and guidelines all place the concept of validity ''at the center of the testing enterprise. It is the concept of validity that guides our work in testing and assessment. What is validity? Until 1989 the same definition had been echoed down the decades. This is taken from Ruch (1924: 13): ''By validity is meant the degree to which a test or examination measures what it purports to measure. Validity might also be expressed more simply as the ‘worthwhileness’ of an examination. For an examination to possess validity it is necessary that the materials actually included be of prime importance, that the questions sample widely among the essentials over which complete mastery can reasonably be expected on the part of the pupils, and that proof can be brought forward that the test elements (questions) can be defended by arguments based on more than mere personal opinion. With this traditional definition, the key validity question has always been: does my test measure what I think it does?’ If the evidence suggests that it does, the responsibility of the test developer is at an end. However, since Messick’s (1989) work, our understanding of validity has changed. It is now seen as a single concept, with a number of different facets, or aspects. The notion of consequential validity extends the possible responsibility of the test developer to all uses of the test. It raises the question of the extent to which the score is relevant and useful to any decisions that might be made on the basis of scores, and whether the use of the test to make those decisions has positive consequences for test takers. The question of relevance and usefulness relates to whether it can be shown that the inferences we draw from a test score about the knowledge, skills and abilities of a test taker are justified. This is the substantive aspect of validity that replaces the traditional definition in the quotation above. Next is the structural aspect, which is closely related to the substantive aspect. If we claim that a test provides information on a number of different skills or abilities, it should be structured and scored according to the skills and abilities of interest. Thirdly, the content of the test should be reasonably representative of the content of a course of study, or of a particular domain (such as ‘aviation English’ or ‘travel Spanish’) in which we are interested. We often wish the test score to be meaningful beyond the immediate questions or tasks on a particular test, as we cannot put all content, situations and tasks on any test; it would simply be too long. So the fourth aspect is generalizability of score meaning beyond the test itself, or whether it is predictive of ability in contexts beyond those modeled in the test. Finally there is the external aspect, or the relationship of the scores on the test to other measures of the same, or different, skills and abilities. We would hope that tests of a particular skill would provide similar results. Convergence gives us more confidence in the test outcomes. Our interest in validity is all about trying to build tests for which there is a strong link between inferences and decisions, and ensuring that test use has a positive impact on people and institutions. Whether the test is for use in the classroom, or for large-scale administration, we need a convincing argument that it is useful for its purpose (Kane, 2006). People engaged in language testing do believe that tests can be used to make fair decisions, and that classroom assessment can inform teaching and learning. Yet, we could easily fall into a counsel of despair when we see how tests of all kinds have been used in society. The practice of testing is itself a social construct. It is a practice invented by humankind to make difficult decisions, and to shape educational practices and institutions. Testing has been used to achieve goals of control and manipulation, and has been used to provide opportunities to those who would otherwise have none. Like all social constructs, it can be used for good or ill. Reliability is not enough; a test must also be valid for its use. If test scores are to be used to make accurate inferences about an examinee's ability, they must be both reliable and valid. Reliability is a prerequisite for validity and refers to the ability of a test to measure a particular trait or skill consistently. However, tests can be highly reliable and still not be valid for a particular purpose. There are several ways to estimate the validity of a test including construct, internal, conclusion, external, criterion, and face validity. Construct validity Construct validity occurs when the theoretical constructs of cause and effect accurately represent the real-world situations they are intended to model. This is related to how well the experiment is operationalized. A good experiment turns the theory (constructs) into actual things you can measure. Sometimes just finding out more about the construct (which itself must be valid) can be helpful.Construct validity is thus an assessment of the quality of an instrument or experimental design. It says 'Does it measure the construct it is supposed to measure'. If you do not have construct validity, you will likely draw incorrect conclusions from the experiment (garbage in, garbage out). Convergent validity: Convergent validity occurs where measures of constructs that are expected to correlate do so. This is similar to concurrent validity (which looks for correlation with other tests). Discriminant validity: Discriminant validity occurs where constructs that are expected not to relate do not, such that it is possible to discriminate between these constructs. Convergence and discrimination are often demonstrated by correlation of the measures used within constructs. Convergent validity and Discriminant validity together demonestrate construct validity.. Content validity Content validity occurs when the experiment provides adequate coverage of the subject being studied. This includes measuring the right things as well as having an adequate sample. Samples should be both large enough and be taken for appropriate target groups. The perfect question gives a complete measure of all aspects of what is being investigated. However in practice this is seldom likely, for example a simple addition does not test the whole of mathematical ability. Internal validity Internal validity occurs when it can be concluded that there is a causal relationship between the variables being studied. A danger is that changes might be caused by other factors. It is related to the design of the experiment, such as in the use of random assignment of treatments. Conclusion validity Conclusion validity occurs when you can conclude that there is a relationship of some kind between the two variables being examined. This may be positive or negative correlation. External validity External validity occurs when the causal relationship discovered can be generalized to other people, times and contexts. Correct sampling will allow generalization and hence give external validity. Criterion-related validity This examines the ability of the measure to predict a variable that is designated as a criterion. A criterion may well be an externally-defined 'gold standard'. Achieving this level of validity thus makes results more credible. Criterion-related validity is related to external validity. Predictive validity: This measures the extent to which a future level of a variable can be predicted from a current measurement. This includes correlation with measurements made with different instruments. For example, a political poll intends to measure future voting intent. College entry tests should have a high predictive validity with regard to final exam results. Concurrent validity: This measure the relationship between measures made with existing tests. The existing tests are thus the criterion. For example a measure of creativity should correlate with existing measures of creativity. Face validity Face validity occurs where something appears to be valid. This of course depends very much on the judgment of the observer. In any case, it is never sufficient and requires more solid validity to enable acceptable conclusions to be drawn. Measures often start out with face validity as the researcher selects those which seem likely prove the point. The validity of a test is critical because, without sufficient validity, test scores have no meaning. The evidence you collect and document about the validity of your test is also your best legal defense should the exam program ever be challenged in a court of law. While there are several ways to estimate validity, for many certification and licensure exam programs the most important type of validity to establish is content validity. Contemporary validity theory has developed procedures for supporting the rationality of decisions based on tests and has thus addressed issues of test fairness. However, although validity theory has also begun again to develop ways of thinking about the social dimensions of the use of tests, many issues are still unresolved, and in fact, it almost feels as if the ongoing effort to incorporate the social in this latter sense goes against the grain of much validity theory, which remains still heavily marked by its origins in the individualist and cognitively oriented field of psychology. Reference Bachman,L,F. (1990). Fundamental Consideration in Language Testing, Oxford University Press Fulcher,G. (2010). Practical Language Testing, Hodder Education